A biological ocean data reformatting effort

Biological ocean data collected from ships find reuse in aggregations of historical data. These data are heavily relied upon to document long term change, validate satellite algorithms for ocean biology and are useful in assessing the performance of autonomous platforms and biogeochemical models. Existing aggregate products have largely been restricted to the surface ocean, omit physical data or have limited biological data. We present the first version of a BIOlogical ocean data reforMATting Effort (BIO-MATE) to begin to fill a gap in subsurface bio-physical data aggregates in a reproducible way. BIO-MATE uses open-source R software that reformats openly sourced published datasets from oceanographic voyages. These reformatted biological and physical data from underway sensors, profiling sensors, pigments analysis and particulate organic carbon analysis are stored in an interoperable BIO-MATE data product for easy access and use. Specific QA/QC protocols can now be easily applied to the BIO-MATE data product to support a variety of surface and subsurface applications.


Background & Summary
Marine phytoplankton blooms support ocean food-webs and influence global climate through the biological carbon pump [1][2][3] .Ocean physics and other environmental drivers control the timing, magnitude and extent of phytoplankton blooms through complex bio-physical relationships [4][5][6][7] .To study these relationships, integrated data structures that link biological and physical ocean data are needed.Ship-based data are the gold standard for accurate biological oceanographic measurements 8 .These data are often published separately to physical ocean data, stored across different repositories and in multiple formats.This makes it difficult and time-consuming to aggregate and link biological and physical data.The described data product attempts to make this task easier.
The biological ocean data reformatting effort (BIO-MATE) works to link existing, open-access biological and physical datasets across oceanographic voyages and promote their re-use (Fig. 1).This has been done by developing a BIO-MATE R software package that not only reformats published datasets, but also cross-references between biological and physical data and allows access to citation information (https://github.com/KimBaldry/BIOMATE-Rpackage).The resulting BIO-MATE data product allows users to easily access, manipulate and cite published ship-based datasets of different dimensions for multiple applications.
The BIO-MATE data product can be accessed via the IMAS Data Portal 9 and the Australian Ocean Data Network (https://portal.aodn.org.au/).The aggregation includes four data streams: (1) data collected from shipboard underway sensors, (2) profiling sensors mounted on sampling rosettes, (3) lab analysis for phytoplankton pigments and (4) lab analysis for particulate organic carbon (POC).These data streams are cross-referenced by unique expedition codes (EXPOCODE) and profiling station identifications (CTD_ID).An additional data stream contains supporting information for the data product including a list of oceanographic voyages, investigator contact information and data citations for reformatted datasets.We have also included an aggregated data table for biological data.Users are requested to refer to supporting data and cite all data products accessed through BIO-MATE, as well as the BIO-MATE data product itself.We consulted the distribution licenses of all data sources to ensure that with this condition data are re-used lawfully.
The data product has been used to understand how the response of in-situ fluorometers changes in the Southern Ocean, to assess non-photochemical quenching corrections and to investigate the role of ocean physics in mediating subsurface chlorophyll features 10 .These examples highlight the malleability of this data product to improve our understanding of biological oceanography.Other potential applications include validating satellite observations 11,12 , developing new ways to validate in-situ bio-optical observations collected by autonomous

Semi-automated BIO-MATE workflow for reformatting datasets.
A semi-automated workflow and the BIO-MATE R software (https://github.com/KimBaldry/BIOMATE-Rpackage)were used to reformat published datasets, and produce the BIO-MATE data product (Fig. 3).Downloaded data files were split by EXPOCODES if they recorded data within a larger dataset (e.g PAL-LTER data records).Files for the profiling sensor data stream were further split into individual profiles.Processing metadata were manually entered into a table to inform the BIO-MATE R software and a bulk run of the software was performed to reformat data files.The workflow is described in more detail in the following subsections (Fig. 2).

Download of published datasets.
Published datasets were manually downloaded from open source repositories and stored locally in accordance with data policies.Some manual reformatting of a small portion of downloaded data had to be performed on old datasets, prior to the application of reformatting scripts, due to formatting irregularities.Downloaded data files, and their amendments used to create the BIO-MATE data product, are not published in BIO-MATE, but are available upon request to the corresponding author.
Splitting large datasets with BIO-MATE software.The BIO-MATE R software requires each file to only contain observations from a single voyage.Further, the profiling sensor data stream requires each file to only contain observations from a single profiling cast, held in a discrete directory for each voyage.
The split_delim_file function splits files using identified variables containing EXPOCODE synonyms and/or profiling station information.This function can be used to split a single, large data file into smaller files as required.For this version of the data product, a number of files had to be split to be ingested into the BIO-MATE core functions.A record of these can be found in Git Hub in the project notebook (https://github.com/KimBaldry/BIO-MATE/blob/main/BIO-MATE.Rmd).
Processing metadata.Information on file formats, dataset information, citation information, location data variables and ocean data variables are needed to reformat published datasets with BIO-MATE software.This information is called processing metadata herein and was manually entered and stored as comma delimited text files.The processing metadata required to run BIO-MATE software is described in the supplement Processing Metadata Table and differs for each data stream.All processing metadata used to construct the BIO-MATE aggregated data product is stored in Git Hub (https://github.com/KimBaldry/BIO-MATE/tree/main/product_data/processing_metadata). Dataset citation with BibTEX files.Information is included in the BIO-MATE data product, for citing published datasets, laboratory analysis methodologies (for the pigment and POC data streams) and the data repositories through which published data records were accessed.Each citation was recorded as a BibTeX entry, compatible with EndNote, R and LaTeX.Each BibTeX entry has a tag that is referenced in the processing metadata.This tag is used to link citations to their corresponding data records when datasets are ingested in the BIO-MATE software.Citation information is then printed in the header information in reformatted files.Where possible BibTeX entries were sourced from data repositories.If BibTeX entries were not found, they were created manually.
All BibTeX entries are stored on Git Hub (https://github.com/KimBaldry/BIO-MATE/product_data/supporting_information/citations) and in the BIO-MATE software (https://github.com/KimBaldry/BIOMATE-Rpackage/inst/citations).A look-up table is included in the BIO-MATE software to help users find relevant BibTeX entries needed to cite datasets appropriately (https://github.com/KimBaldry/BIOMATE-Rpackage/tree/main/data).A function export_ref supports the export of a smaller BibTeX file based Fig. 4 The spatiotemporal distribution of different data streams and bio-physical matches in the BIO-MATE data compilation.on user selections of EXPOCODES and data streams that they have accessed through the product.This allows references to be easily appended to a bibliography as required.
Reformatting and linking data streams with BIO-MATE R software.The BIO-MATE R software was run to reformat data files to the WHP (CCHDO)-Exchange format (https://exchange-format.readthedocs.io/en/latest/index.html),using the original or split data files, processing metadata and citation information as input.The software arranges reformatted WHPE files into four data streams in local directories that include separate WHPE files, for each EXPOCODE, and for underway sensors, profiling sensor casts, pigment measurements, and POC measurements.
Each data stream has its own reformatting function within the BIO-MATE R software (UWY_to_WHPE, PROF_to_WHPE, PIG_to_WHPE, POC_to_WHPE).The software requires physical (underway sensor and profiling sensor) data streams to be reformatted before biological (pigment and POC) data streams, to accommodate a biological-physical matching algorithm within the PIG_to_WHPE and POC_to_WHPE functions.The algorithm links biological data in the pigment and POC data streams to the physical data in the profiling sensor and underway sensor data streams.Biological data records are given a profiling sensor identification tag (CTD_ID) if matched to physical data in BIO-MATE.
To match biological data to physical data, the algorithm first uses EXPOCODES to find relevant physical data in profiling sensor data streams.It then matches biological and physical data records by comparing station number (STNBR) and cast number (CASTNO) records.If matches are detected using STNBR and CASTNO, the validity of these matches is checked by comparing time and position information, but if position and time were not recorded in biological datasets (4157 pigment and 1948 POC records) it is assumed that the STNNBR and CASTNO records are correct if they match between data streams (e.g. in the JGOFS records).If position or time was recorded, a check on identified matches is performed to see if both the biological and physical data record data either within 24 hours of each other or within 8 km 507 .This quick check catches cases where STNNBR and CASTNO are used in similar ways within physical and biological sampling, but exact matches do not correspond to the same sampling event.
If matches couldn't be identified using STNBR and CASTNO between datasets a more rigorous search was performed using a database of time and position information from all profiling sensor data relating to the EXPOCODE.Matches were then found for biological data, if it contains position information, by finding the closest profiling sensor record within 1km in the database.If time information exits, matches are identified as the closest profiling sensor record within 6 hours, otherwise only matching date information is required.These position and time constraints are tighter than if STNNBR and CASTNBR records were matched.Using this process all biological sampling events had matching physical sampling events.Matching has only been implemented with physical profiling sensor data and not to physical underway data.Underway surface data do not require station IDs and are more simply found with EXPOCODE and position and/or time.
Quality assurance.Limited quality assurance has been performed on the BIO-MATE data product and is variable across published datasets.As a supplement we include some insights into the quality of pigment data and  chlorophyll fluorescence profiles which has been obtained through visual inspection (Supplementary QA/QC).
The initial integrity of these data records lies with the Principal Investigators of the published data record.As a result, reformatted data have varying levels of quality control and post-processing.We have included cruise report citations in our product to aid in further data quality assurance efforts.This allows a range of users to benefit from the BIO-MATE aggregate product and ensures data quality remains at the standard it was published.The quality assurance required of physical and biological ocean data varies according to application and is up to the user to confirm the data is suitable for their application.Future versions of BIO-MATE could implement quality assurance metrics under community consensus.The data can now be easily ingested by other data synthesis efforts, like GLODAP 466,467 and the World Ocean Database (WOD), which implement established QA/QC protocols.

Data Records
The BIO-MATE data product 9 is stored on the IMAS data portal (https://data.imas.utas.edu.au) and available on the AODN (https://portal.aodn.org.au/),formatted as four data streams linked through unique EXPOCODES.Supporting data contains a metadata table and BibTeX citation files.The spatial extent of the data records is confined largely to the Southern Ocean and was collected from 1985-2018 (Fig. 4).A summary of the data records in the BIO-MATE aggregate data product is presented in Tables 1-2.Underway sensor data stream.The underway sensor data stream contains a comma delimited WHP-Exchange file for each voyage ([EXPOCODE]_UWY.csv).The format of this file consists of headers to store metadata, followed by a data table that reports records collected by underway sensors mounted on the vessel (Data Records Table 1).
Profiling sensor data stream.The profiling sensor data stream contains a comma delimited WHP-Exchange file for each unique profiling cast conducted on each voyage ([EXPOCODE][station number][cast number]_ctd1.csv).The file is formatted to store metadata as headers which is followed by the data table that reports records from profiling sensors mounted on a sampling rosette (Data Records Table 2).

Pigment data stream. The pigment data stream contains a comma delimited WHP-Exchange file for each voyage (named [EXPOCODE]PIG[SOURCED_FROM]_[METHOD].csv)
. The format of this file consists of headers to store supporting information, followed by a data table that records measurements from the laboratory analysis of seawater samples for pigments performed by principal investigators (Data Records Table 3).The laboratory analyses considered are fluorometric determination and high-performance liquid chromatography (HPLC).
Particulate organic carbon data stream.The POC data stream contains a comma delimited WHP-Exchange file for each voyage (named [EXPOCODE]POC[SOURCED_FROM]_[METHOD].csv).The format of this file consists of headers to store supporting information, followed by a data table that records measurements from the lab analysis of seawater samples for particulate organic carbon performed by principal investigators (Data Records Table 4).
Supporting data.Supporting data are included in the BIO-MATE aggregate data product to support the correct citation of data and guide user access to data.This data includes (1) A BibTeX file, that contains information to reference all BIO-MATE data records (2) An index table indicating data availability and citation tags against data records listed by EXPOCODE, data stream, method and source, (3) A records table for all data repositories from which BIO-MATE data was sourced and (4) A records table for all pigment and POC analysis methods used in BIO-MATE data.Fig. 7 A comparison of fluorometrically derived chlorophyll (FCHLORA) methods against total chlorophyll-a derived from HPLC measurements (TCHLA).The methods presented in this figure are ANTXVII_2 491 , JGOFS 490 and PALMER_LTER 496 which are widely used in the dataset.

technical Validation
We validated the quality of the BIO-MATE data compilation, by displaying a number of key data distributions and trends.This validation does not confirm the quality of individual data points, in which the authors have placed no additional quality assurance to the published datasets.
The location data associated with the published datasets has been interpreted correctly by the software.This is evident from the success of the biophysical matching algorithm, along with the spatial distribution of the data and recorded sampling depths (Fig. 4).The data are predominantly collected in the month of January between 1991-2010.This is consistent with the fact that ship-based sampling in the Southern Ocean is conducted during Austral summer and displays a lag time in publishing most recent datasets to data repositories.All data are in the ocean, not on land, confirming the absence of spurious location data, and most samples are located in the Southern Ocean which is consistent with our search constraints.Finally information on sampling time of ship-based biological data is as expected, and CTD sampling times (start, bottom and end) are sequential and follow a trend with sampling depth (Fig. 5).
The biological ocean data associated with the published datasets has been interpreted correctly by the software.Overall, fluorometrically derived chlorophyll (FCHLORA), HPLC derived chlorophyll a (Chl a) and HPLC derived total chlorophyll (TCHLA) measurements show a log-normal distribution, as expected.High values (>10 μg/l) are constrained to the coastal zones as expected (Fig. 6).
There is a linear relationship between chlorophyll-a derived from HPLC methods and chlorophyll derived from fluorometric methods (Fig. 7).Five fluorometric methods to derive chlorophyll have coincident HPLC measurements.Briefly, the ANTXVIII 492 _and JGOFS 491 methods shows good correlation between the fluorometric and HPLC.The PALMER_LTER 497 method shows considerable variability, which may be due to the coastal location of most samples and the influence of accessory pigments, but further investigation is needed.Only a small number of coincident HPLC measurements were collected alongside other fluorometric methods (<11), making it difficult to assess their quality.
Our validation plots show that fluorometric determination of chlorophyll tends to overestimate chlorophyll in the Southern Ocean.However, considerable variability is observed as the over-estimation or underestimation of chlorophyll-a by fluorometry is regionally dependant with changing phytoplankton assemblages [508][509][510] .A recent international intercomparison has also highlighted higher uncertainties, possibly due to low filtration volumes and different extraction and storage methods and suggests new standards for these measurements 511 .Despite these uncertainties, visual inspection of the profiles show that distributions of chlorophyll with depth are often well captured by fluorometric measurements.
The physical ocean data associated with published datasets has been interpreted correctly by the software.Temperature and salinity ranges fall within expected vales for the ocean and display expected trends with latitude (Fig. 8).

Usage Notes
The community is welcome to contribute to the development of BIO-MATE software and to contribute published data to the aggregation, by following a user guide (Fig. 3).

Fig. 1
Fig.1The BIO-MATE concept for creating a consistent data compilation from existing ship-based oceanographic data.

Fig. 2
Fig. 2 Typical data collection and treatment process for biological oceanographic data within the BIO-MATE data compilation.

Fig. 3 A
Fig. 3 A schematic demonstrating the BIO-MATE workflow.

Fig. 5
Fig.5 The time difference between the bottom depth (i.e.deepest position on the cast) and end depth (i.e last sampling position) of a profiling sensor cast versus the bottom depth of the cast.Outliers with a bottom depth close to 0, likely represent shallow testing or calibration casts.

Fig. 6
Fig.6The (a) distribution of chlorophyll-a derived from high-performance liquid chromatography (Chla), total chlorophyll-a derived from high-performance liquid chromatography (TCHLA) and chlorophyll-a derived from fluorometric determination (FCHLORA) in the BIO-MATE data compilation and (b) the location of high (>10 μg/l) Chla, TCHLA and FCHLA measurements.

Fig. 8
Fig.8The distribution of temperature and salinity data measured at 10m by profiling sensors in the BIO-MATE data compilation.

Table 1 .
Summary of the pigment data records contained in the first version of BIO-MATE.

Table 2 .
Summary of the profiling sensor data contained within the first version of BIO-MATE.